Template 5:
Welcome! This template will guide you through a Bayesian analysis in R, even if you have never done Bayesian analysis before. There are a set of templates, each for a different type of analysis. This template is for data with two categorical independent variables and will produce a line graph. If your analysis includes a two-way ANOVA, this might be the right template for you. In most cases, we do not recommend using line charts for this type of analysis; a bar chart is usually the better option.
This template assumes you have basic familiarity with R. Once complete, this template will produce a summary of the analysis, complete with parameter estimates and credible intervals, and two animated HOPs (see Hullman, Resnick, Adar 2015 DOI: 10.1371/journal.pone.0142444 and Kale, Nguyen, Kay, and Hullman VIS 2018 for more information) for both your prior and posterior estimates.
This Bayesian analysis focuses on producing results in a form that are easily interpretable, even to nonexperts. The credible intervals produced by Bayesian analysis are the analogue of confidence intervals in traditional null hypothesis significance testing (NHST). A weakness of NHST confidence intervals is that they are easily misinterpreted [sources for all of this]. Many people naturally interpret an NHST 95% confidence interval to mean that there is a 95% chance that the true parameter value lies somewhere in that interval; in fact, it means that if the experiment were repeated 100 times, 95 of the resulting confidence intervals would include the true parameter value. The Bayesian credible interval sidesteps this complication by providing the intuitive meaning: a 95% chance that the true parameter value lies somewhere in that interval. To further support intuitive interpretations of your results, this template also produces animated HOPs plots, a type of plot that is more effective than visualizations such as error bars in helping people make accurate judgments about probability distributions.
This set of templates supports a few types of statistical analysis. (In future work, this list of supported statistical analyses will be expanded.) For clarity, each type has been broken out into a separate template, so be sure to select the right template before you start! A productive way to choose which template to use is to think about what type of chart you would like to produce to summarize your data. Currently, the templates support the following:
One independent variable:
Categorical; bar graph (e.g. t-tests, one-way ANOVA)
Ordinal; line graph (e.g. t-tests, one-way ANOVA)
Continuous; line graph (e.g. linear regression)
Two independent variables:
Two categorical; bar graph (e.g. two-way ANOVA)
One categorical, one ordinal; line graph (e.g. two-way ANOVA)
One categorical, one continuous; line graph (e.g. linear regression with multiple lines)
Note that this template fits your data to a model that assumes normally distributed error terms. (This is the same assumption underlying t-tests, ANOVA, etc.) This template requires you to have already run diagnostics to determine that your data is consistent with this assumption; if you have not, the results may not be valid.
Once you have selected your template, to complete the analysis, please follow along this template. For each code chunk, you may need to make changes to customize the code for your own analysis. In those places, the code chunk will be preceded by a list of things you need to change (with the heading “What to change”), and each line that needs to be customized will also include the comment #CHANGE ME within the code chunk itself. You can run each code chunk independently during debugging; when you’re finished, you can knit the document to produce the complete document.
Good luck!
This template comes prefilled with an example dataset from Moser et al. (DOI: 10.1145/3025453.3025778), which examines choice overload in the context of e-commerce. The study examined the relationship between choice satisfaction (measured at a 7-point Likert scale), the number of product choices presented on a webpage, and whether the participant is a decision “maximizer” (a person who examines all options and tries to choose the best) or a “satisficer” (a person who selects the first option that is satisfactory). In this template, we analyze the relationship between choice set size, which we treat as an ordinal variable in this template with possible values [12,24,40,50,60,72]; type of decision-making (maximizer or satisficer), a two-level categorical variable; and choice satisfaction, which we treat as a continuous variable with values that can fall in the range [1,7].
If this is your first time using the template, you may need to install libraries. Uncomment the lines below - install.packages() and devtools::install_github() - to install the required packages. This only needs to be done once.
knitr::opts_chunk$set(fig.align="center")
# Cache results of code chunks to speed up repeated knitting
# This can cause problems in this analysis
# Only uncomment this if you have read in your dataset, set variables, and set priors
knitr::opts_chunk$set(cache = TRUE)
# install.packages("rstanarm", "tidyverse", "tidybayes", "modelr", "devtools")
# devtools::install_github("thomasp85/gganimate")
library(rstanarm) #bayesian analysis package
library(tidyverse) #tidy datascience commands
library(tidybayes) #tidy data + ggplot workflow
library(modelr) #tidy pipelines when modelling
library(ggplot2) #plotting package
library(gganimate) #animate ggplots
theme_set(theme_light()) # set the ggplot theme for all plots What to change
mydata = read.csv('datasets/choc_cleaned_data.csv') #CHANGE ME 1We’ll fit the following model: stan_glm(y ~ x_1 * x_2), where both \(x_1\) and \(x_2\) are categorical variables. This specifies a linear regression with dummy variables for each level in \(x_1\) and \(x_2\), plus interaction terms for each combination of \(x_1\) and \(x_2\). This is equivalent to ANOVA. So for example, for a regression where \(x_1\) has three levels and \(x_2\) has two levels, each \(y_i\) is drawn from a normal distribution with mean equal to \(a + ...\) and standard deviation equal to sigma (\(\sigma\)):
\[ \begin{aligned} y_i \sim Normal(a + b_{x1a}dummy_{x1a} + b_{x1b}dummy_{x1b} + \\ b_{x2}dummy_{x2} + \\ b_{x2}dummy_{x2} * b_{x1a}dummy_{x1a} + \\ b_{x2}dummy_{x2} * b_{x1b}dummy_{x1b}, \\\sigma) \end{aligned} \]
Choose your independent and dependent variables. These are the variables that will correspond to the x and y axis on the final plots.
What to change
mydata$x1: Select which variables will appear on the x-axis of your plots.
mydata$x2: Select the second independent variable that will group your data; think of this as the group aesthetic in ggplot.
mydata$y: Select which variables will appear on the y-axis of your plots.
x_lab: Label your plots’ x-axes.
y_lab: Label your plots’ y-axes.
#select your independent and dependent variables
mydata$x1 = as.factor(mydata$num_products_displayed) #CHANGE ME 2
mydata$x2 = mydata$sat_max #CHANGE ME 3
mydata$y = mydata$satis_Q1 #CHANGE ME 4
# label the axes on the plots
x_lab = "Choices" #CHANGE ME 5
y_lab = "Satisfaction" #CHANGE ME 6In this section, you will set priors for your model. Setting priors thoughtfully is important to any Bayesian analysis, especially if you have a small sample of data that you will use for fitting for your model. The priors express your best prior belief, before seeing any data, of reasonable values for the model parameters. The model estimation process produces a posterior distribution: beliefs about the plausible values for parameters, given the data in your dataset.
Ideally, you will have previous literature from which to draw these prior beliefs. If no previous studies exist, you can instead assign “weakly informative priors” that only minimally restrict the model; for example, a weakly informative prior for a parameter that can only have values between 1 and 7 would assign a very small probability to values outside of that range. We have provided an example of how to set priors below.
To check the plausibility of your priors, use the code section after this one to generate a graph of 100 sample draws from your priors to check if the values generated are reasonable.
Our model has the following parameters: a. the overall mean y-value across all levels of categorical variable x b. the mean y-value for each of the individual levels c. the standard deviation of the normally distributed error term
To simplify things when there are more than two levels for the x-variable, we limit the number of different prior beliefs you can have for the means at different x-levels. Think of the first level of the categorical variable as specifying the control condition of an experiment, and all of the other levels being treatment conditions in the experiment. We let you specify a prior belief about the plausible values of mean in the control condition, and then we let you set a prior belief about the plausible effect size. You have to specify the same plausible effect sizes for all conditions, unless you dig deeper into our code than the few spots we’ve told you to change.
To simplify things further, we only let you specify beliefs about these parameters in the form of a normal distribution. Thus, you will specify what you think is the most likely value for the parameter (the mean), and a standard deviation. You will be expressing a belief that you were 95% certain (before looking at any data) that the true value of the parameter is within two standard deviations of the mean.
Finally, our modeling system, stan_glm(), will automatically set priors for the last parameter, the standard deviation of the normally distributed error term for the model overall. That is almost never a model parameter that you will be interpreting, and STAN does a reasonable job of assigning weakly informative priors for that parameter in a way that it won’t have an impact on the estimation of other parameters, which we normally do care about.
What to change
a_prior: Select the control condition mean.
a_sd: Select the control condition standard deviation.
b1_prior: Select the effect size mean.
b1_sd: Select the effect size standard deviation.
You should also change the comments in the code below to explain your choice of priors.
# CHANGE THIS COMMENT EXPLAINING YOUR CHOICE OF PRIORS (11)
# In our example dataset, y-axis scores can be in the range [1, 7].
# Thus, a mean value in the control condition of less than 1 or greater
# than 7 is impossible. With a normal distribution, we can't completely
# rule out those impossible values, but we choose a mean and sd that assign
# less than 5% probability to those impossible values.
# We select the mean of the range [1,7], and an sd that assigns a
# 95% probability to values that vary up to +3/-3 from this mean.
a_prior = 4 # CHANGE ME 7
a_sd = 1.5 # CHANGE ME 8
# CHANGE THIS COMMENT EXPLAINING YOUR CHOICE OF PRIORS (11)
# In our example dataset, we do not have a strong hypothesis that the treatment
# conditions will be higher or lower than the control, so we set the mean of
# the effect size parameters to be 0. In the absence of other information, we
# set the sd to be the same as for the control condition.
b1_prior = 0 # CHANGE ME 9
b1_sd = 1.5 # CHANGE ME 10Next, you’ll want to check your priors by running this code chunk. It will produce a set of 100 sample plots drawn from the priors you set in the previous section, so you can check to see if the values generated are reasonable. (We’ll go into the details of this code later.)
What to change
Nothing! Just run this code to check your priors, adjusting prior values above as needed until you find reasonable prior values. Note that you may get a couple of very implausible or even impossible values because our assumption of normally distributed priors assigns a small probability to even very extreme values. If you are concerned by the outcome, you can try rerunning it a few more times to make sure that any implausible values you see don’t come up very often.
#generate sample draws from the priors
m_prior = stan_glm(y ~ x1*x2, data = mydata,
prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
prior = normal(b1_prior, b1_sd, autoscale = FALSE),
prior_PD = TRUE
)
#create the dataframe for fitted draws & plot the sample draws
mydata %>%
data_grid(x1, x2) %>%
add_fitted_draws(m_prior, n = 100, seed = 12345) %>%
ggplot(aes(x = x1, y = .value, col = x2, group = x2)) +
geom_line(aes(group = .draw), alpha = .2) +
facet_grid(cols = vars(x2)) +
# coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
legend.position="none") +
labs(x=x_lab, y=y_lab) + # axes labels
ggtitle("100 sample draws from the priors") There’s nothing you have to change here. Just run the model.
m = stan_glm(y ~ x1*x2, data = mydata,
prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
prior = normal(b1_prior, b1_sd, autoscale = FALSE)
)Here is a summary of the model fit:
summary(m, digits=3)##
## Model Info:
##
## function: stan_glm
## family: gaussian [identity]
## formula: y ~ x1 * x2
## algorithm: sampling
## priors: see help('prior_summary')
## sample: 4000 (posterior sample size)
## observations: 611
## predictors: 12
##
## Estimates:
## mean sd 2.5% 25% 50% 75%
## (Intercept) 6.116 0.122 5.875 6.031 6.117 6.201
## x124 -0.134 0.179 -0.483 -0.254 -0.130 -0.015
## x140 0.173 0.176 -0.168 0.051 0.171 0.291
## x150 -0.118 0.164 -0.436 -0.231 -0.119 -0.004
## x160 -0.182 0.170 -0.523 -0.294 -0.184 -0.065
## x172 -0.015 0.176 -0.361 -0.136 -0.011 0.103
## x2satisficer 0.189 0.189 -0.188 0.065 0.194 0.315
## x124:x2satisficer -0.093 0.292 -0.672 -0.284 -0.093 0.102
## x140:x2satisficer -0.118 0.284 -0.671 -0.307 -0.115 0.074
## x150:x2satisficer 0.130 0.281 -0.422 -0.058 0.127 0.318
## x160:x2satisficer 0.082 0.280 -0.460 -0.107 0.082 0.269
## x172:x2satisficer -0.022 0.275 -0.562 -0.208 -0.021 0.161
## sigma 1.024 0.030 0.967 1.003 1.023 1.045
## mean_PPD 6.132 0.058 6.018 6.093 6.132 6.172
## log-posterior -895.530 2.552 -901.523 -896.989 -895.231 -893.683
## 97.5%
## (Intercept) 6.349
## x124 0.217
## x140 0.519
## x150 0.195
## x160 0.152
## x172 0.329
## x2satisficer 0.569
## x124:x2satisficer 0.487
## x140:x2satisficer 0.436
## x150:x2satisficer 0.674
## x160:x2satisficer 0.620
## x172:x2satisficer 0.525
## sigma 1.086
## mean_PPD 6.246
## log-posterior -891.469
##
## Diagnostics:
## mcse Rhat n_eff
## (Intercept) 0.003 1.003 1253
## x124 0.004 1.001 2040
## x140 0.004 1.002 1876
## x150 0.004 1.002 1615
## x160 0.004 1.002 1615
## x172 0.004 1.002 1817
## x2satisficer 0.006 1.003 1083
## x124:x2satisficer 0.007 1.001 1731
## x140:x2satisficer 0.007 1.001 1861
## x150:x2satisficer 0.006 1.002 1904
## x160:x2satisficer 0.006 1.000 1963
## x172:x2satisficer 0.006 1.002 1798
## sigma 0.000 1.001 4000
## mean_PPD 0.001 1.001 4000
## log-posterior 0.065 1.002 1563
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
To plot the results, we will first construct a fit grid: a data frame of points at which we want to calculate a value from the model. The data_grid function allows us to do this easily, e.g. by asking for a point corresponding to every combination of values of the x and group variables in our original data:
mydata %>%
data_grid(x1, x2)## # A tibble: 12 x 2
## x1 x2
## <fct> <fct>
## 1 12 maximizer
## 2 12 satisficer
## 3 24 maximizer
## 4 24 satisficer
## 5 40 maximizer
## 6 40 satisficer
## 7 50 maximizer
## 8 50 satisficer
## 9 60 maximizer
## 10 60 satisficer
## 11 72 maximizer
## 12 72 satisficer
Given this fit grid, we can then create any number of visualizations of the results. One way we might want to visualize the results is a static graph with error bars that represent +1/-1 standard deviation. For each x position in the fit grid, we can get the posterior mean estimates and standard deviations from the model:
#TODO CHANDA - THIS IS ALL BROKEN.
#Come back to this once Matt has looked at the simple bar chart - he might have a solution that will transfer easily to thisEven better would be to animate this graph using HOPs (Hypothetical Outcomes Plot), a type of plot that visualizes uncertainty as sets of draws from a distribution, which has been demonstrated to improve multivariate probability estimates (Hullman et al. 2015) and increase sensitivity to the underlying trend in data (Kale et al. 2018) over static representations of uncertainty like error bars.
To set up the HOPs plots, we will first set the aesthetics for the ggplot that we will use.
You can set the aesthetics of your HOPs plots here.
What to change
In most cases, the default values here should be just fine. If you want to adjust the aesthetics of the animated plots later, you can do so here using the ggplot2 package; just be sure to keep the lines that are commented with “do not change.” Below are two optional code customizations that we think may be particularly useful for some datasets.
[Optional] coord_cartesian(ylim = …): You may want to manually set the y-axis limits. If so, uncomment this line in the code below and set your preferred limits accordingly.
[Optional] scale_x_discrete(limits = …): You may want to manually set the order of the x-axis levels; for example, if you have levels “before” and “after,” ggplot defaults often plot “after” on the left and “before” on the right. If so, uncomment this line in the code below and set your preferred level order. The names of the levels must match what is in your dataset.
# the default code for the plots - if needed, the animated plot aesthetics can be customized here
graph_plot <- function(data) {
ggplot(data, aes(x = x1, y = .value, col = x2, group = x2)) + #do not change
geom_line() + #do not change
transition_states(.draw, transition_length = 1, state_length = 1) + # gganimate code to animate the plots. Do not change
coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + #rotates the x-axis text for readability
# scale_x_discrete(limits=c("before","after")) + #manually set the order of the x-axis levels
labs(x=x_lab, y=y_lab) # axes labels
}
# Animation parameters
n_draws = 100 # the number of draws to visualize in the HOPs plots
frames_per_second = 2.5 # the speed of the HOPs
# 2.5 frames per second (400ms) is the recommended speed for the HOPs visualization.
# Faster speeds (100ms) have been demonstrated to not work as well.
# See Kale et al. VIS 2018 for more info.Now that the plot aesthetics are set, we can return to our fit grid and repeatedly draw samples from the posterior mean evaluated at each x position in the grid using the add_fitted_draws function. Each frame of the animation shows a different draw from the posterior:
p <- mydata %>% #pipe mydata to datagrid()
data_grid(x1, x2) %>% #create a fit grid with each level in x, and pipe it to add_fitted_draws()
add_fitted_draws(m, n = n_draws, seed = 12345) #add n fitted draws from the model to the fit grid
# the seed argument is for reproducibility: it ensures the pseudo-random
# number generator used to pick draws has the same seed on every run,
# so that someone else can re-run this code and verify their output matches
#animate the data from p, using the graph aesthetics set in the graph aesthetics code chunk
animate(graph_plot(p), nframes = n_draws * 2, fps = frames_per_second) We already looked at some sample plots of the priors when we were setting priors; now we want to look at these priors again, but in a HOPs format so we can compare to the posterior plots. To get the prior plots, we can simply ask stan_glm to sample from the prior.
What to change
If you are knitting this document, or if you already ran the code in the “Check priors” section that calculates m_prior, you can comment out this line:
#m_prior = update(m, prior_PD = TRUE)Then our code to generate plots is identical, except we replace m with m_prior:
p_prior = mydata %>%
data_grid(x1, x2) %>%
add_fitted_draws(m_prior, n = n_draws, seed = 12345)
animate(graph_plot(p_prior), nframes = n_draws * 2, fps = frames_per_second)